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Abstract —Text mining is a process of extracting information 
of interest from text. Such a method includes techniques from 
various areas such as Information Retrieval (IR), Natural Lan¬ 
guage Processing (NLP), and Information Extraction (IE). In this 
study, text mining methods are applied to extract causal relations 
from maritime accident investigation reports collected from the 
Marine Accident Investigation Branch (MAIB). These causal 
relations provide information on various mechanisms behind 
accidents, including human and organizational factors relating 
to the accident. The objective of this study is to facilitate the 
analysis of the maritime accident investigation reports, by means 
of extracting contributory causes with more feasibility. A careful 
investigation of contributory causes from the reports provide 
opportunity to Improve safety in future. 

Two methods have been employed in this study to extract the 
causal relations. They are I) Pattern classification method and 
2) Connectives method. The earlier one uses naive Bayes and 
Support Vector Machines (SVM) as classifiers. The latter simply 
searches for the words connecting cause and effect in sentences. 

The causal patterns extracted using these two methods are 
compared to the manual (human expert) extraction. The pattern 
classification method showed a fair and sensible performance 
with P-measure(average) = 65% when compared to connectives 
method with F-measure(average) = 58%. This study is an evi¬ 
dence, that text mining methods could be employed in extracting 
causal relations from marine accident investigation reports. 


I. Introduction 

There is a growing concern in the maritime industry re¬ 
garding human and organizational factors that affect sailing 
performance and the overall safety of ship operations in 
and onboard [6]. This concern stems from a recent rise in 
commercial maritime accidents caused by ill-fated decisions 
taken by higher level management. This is further highlighted 
by academic research showing direct ties between organiza¬ 
tional factors and safe performance of maritime crew of the 
ship. However, effective tools or methodologies for identifying 
and mitigating potentially harmful human and organizational 
factors before they cause an accident are yet to be developed. 

The purpose of the present research is to extract the causal 
patterns from accident investigation reports. These patterns 
study human and organizational factors affecting safety culture 
and discuss models of safety culture used to design assessment 
techniques. A careful investigation of these patterns provides 
an opportunity to improve and manage safety in the future [53]. 
This study aspires to model causal parameters relating acci¬ 
dents. 
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A. Motivation 

During the last century, sea trade has been increased due to 
technological advancements [27]. Hence, increasing number 
of ships are sailing on the world seas. Modern ships are 
getting faster, bigger and highly automated. Though these 
technological advancements are beneficial, they still pose a 
challenge in themselves. Accidents at sea still occur and the 
consequences to people, ship or environment, are often greater 
than before [26]. 

These accidents are investigated by a maritime accident 
investigation board. The board reports how the accident oc¬ 
curred, the circumstances, causes, consequences and rescue 
operations. These reports also provide recommendations for 
preventing similar accidents. The reports are long, detailed 
and systematic examinations of marine accidents in order to 
determine the causes of the accident. 

In this paper, the accident investigation reports are a collec¬ 
tion from Maritime Accident Investigation Branch (MAIB). 
MAIB examines and investigates all types of marine accidents 
to or on board United Kingdom (UK) ships worldwide, and 
other ships in UK territorial waters. It includes 11 categories 
of reports relating to ’Machinery Failures’, ’Fire/Explosion’, 
’Injury/Fatality’, ’Grounding’, ’Collision/Contact’, ’Flood¬ 
ing/Foundering’, ’Listing/Capsize’, ’Cargo Handling Failure’, 
’Weather Damage’, ’Hull Defects’ and ’Hazardous Incidents’. 

Human intervention is required in extracting the causal 
patterns from the accident investigation reports, as they are 
in text format. The extraction is generally a difficult job as 
it takes lot of time and also human may not always be able 
to extract the interesting information objectively [27]. Hence, 
these challenges have been attempted with text mining. As an 
example, the role of lack of situation awareness in maritime 
accident causation was examined using a text mining software 
from accident reports [17]. 

B. Previous Studies 

According to [17], causal patterns from the accident inves¬ 
tigation reports provide information on various mechanisms 
behind accidents. Unfortunately, in the maritime field, no 
standard reporting formats exist and data collection from the 
textual reports is a laborious task [60]. Text mining provides a 
means for efficient and informative scanning of accident cases 
of interest without reading the actual report. Therefore, text 
mining in this context is seen as a useful tool in understanding 
accidents and their influencing factors. 

[14] applied text mining methods on two text databases, 
a road accident description and on survey databases. They 
extracted new variables from the unstructured text which were 
later used for predicting the likelihood of attorney involvement 



and the severity of claims. Interesting themes were identified 
in the responses of the survey data. Thus, useful information 
that would not otherwise be available was derived from both 
the databases using text mining methods. [78] investigated and 
validated a novel text mining methodology for occupational 
accident analysis and prevention. He also suggested that adop¬ 
tion of text mining analysis is probably most feasible for large 
organizations that can more easily absorb the labour-intensive 
steps required to conduct the most meaningful text mining 
analysis of occupational injury data. Another article by [80] 
used a text data mining technique called attribute reduction 
from accident reports to extract most frequent concepts which 
were considered as the reasons leading to human errors in 
ship accidents. An article by [1] developed and evaluated 
software using text mining algorithms for encountering marine 
hazards. This essential risk management system covered both 
organizational and human errors. 

The previous studies suggest that text mining could be 
applied on accident investigation reports. However, application 
of text mining is a complex task as it involves dealing with 
the text data which is unstructured. Hence, there is an urgent 
need for a new generation of computational theories and tools 
to assist humans in extracting useful information (knowledge) 
from the rapidly growing volumes of unstructured accident 
investigation reports. 

C. Research Problem 

Mining the maritime accident investigation reports is a new 
topic and not much has been covered [29]. Until now, it is 
still regarded as one of the challenging areas since reports have 
been written in natural language [60]. The latest developments 
in Natural Language Processing (NLP) and the availability of 
faster computers facilitates to extract more information from 
the text. Emphasis should be placed on mining information 
from unstructured information sources like accident investiga¬ 
tion reports. 

The research problem is formulated as follows: 

• How can causal relations be extracted from maritime 
accident investigation reports? 

The following research questions help solving the research 
problem. 

• How are the accident investigation reports written and 
structured? 

• What categories of accident investigation reports should 
be considered? 

• What models and algorithms should be chosen for this 
application? 

• How are these models evaluated? 

These research questions are answered reasonably in this 
paper. They are intended as a support for solving the re¬ 
search problem. Whilst performing the study, knowledge of 
classification techniques is also acquired and documented. This 
section briefly presents the aim, limitations of the study and 
the structure of paper. 

The main objective is to facilitate analysis of maritime 
accident investigation reports describing the human and or¬ 
ganizational factors in accidents. These factors are extracted 


as causal relations using text mining methods. The study 
uses pattern classification and connectives methods to mine 
causal relations. In both these methods F-measure is used 
to evaluate the performance. Other rule based techniques 
including extraction of sentences based on syntactic grammars 
are left outside the scope of this study. The main reason is that 
these methods use Parts of Speech (PoS) taggers and there is 
no PoS tagger that gives a 100% accuracy [36]. An inaccurate 
PoS tag can change the grammar of a causal sentence to that 
of a non-causal. 

D. Limitations 

The analysis in this study is limited to mining the causal text 
relating to ’Groundings’, ’Collisions’, ’Machinery Failures’ 
and ’Fire’ related accidents. The scope of the study has also 
been limited by focusing only on pattern classification and 
connectives methods for extracting the causal relations to keep 
the study to a reasonable size. 

There are quite a few challenges when dealing with accident 
investigation reports. The reports are written in the natural 
language with no standard template. Misspellings and abbre¬ 
viations are often found. Detection of compound words such 
as ’’safety culture”, ’’spirit status”, etc are difficult as order of 
importance is unknown. The contextual meaning of the words 
’’safety” and ’’culture” differs significantly but the word ’’safety 
culture” has a different meaning altogether. Therefore, context 
and semantics play an important role in text mining. 

E. Outline 

Section 2 introduces the causal relation extraction methods 
employed in this study, such as: 1) pattern classification 
method and 2) connectives method. The former consists of 
naive Bayes and SVM classifiers and the latter uses connecting 
words. This chapter also discusses the evaluation techniques 
such as F-measure, K-fold cross validation and parameter 
tuning. Section 3 illustrates the data preprocessing techniques 
such as: tokenization, stop word removal and stemming. It fur¬ 
ther discusses the document representation. Section 4 presents 
the experiments and corresponding results. Finally section 5 
concludes the paper with discussions. 

H. Methodolgies: Causal Relations Extraction 

A causal relation is the relation between an event (the cause) 
and a second event (the effect), where the second event is 
understood as a consequence of the first [23]. In other words, 
cause is the producer and effect is the result [18]. Causal 
relations have been studied in several fields. [73] provides 
an overview of theories within the fields of Philosophy and 
Psychology. This study explores two different methods for 
extracting causal relations from maritime accident investiga¬ 
tion reports. They are the pattern classification method and 
connectives method. 

A. Pattern Classification Method 

Pattern recognition is a subfield in machine learning with 
a purpose of developing methods that recognize meaningful 
patterns from the data. Pattern recognition has seen appli¬ 
cations in the fields of 1) computational fluid dynamics for 



reduce order modelling [68], [58], [56], [64]. 2) In forensics, 
biometrics for detecting spoof images/videos [63], [22], [65]. 
3) In healtcare applications [62], [61], [45], [57] and 4) in 
NLP [60], [59], [43], [49], [28], [47]. Pattern classification, 
on the other hand is a subset of pattern recognition which 
is based on the classihcation of features. In other words, 
pattern classification observes the environment to learn and 
distinguish patterns of interests and make reasonable decisions 
about the pattern (or finding the correct class represented by 
the pattern) [69]. The decision of the pattern classifiers depend 
on the prior available patterns. The more relevant patterns are 
available for the pattern classifier, the better the decision will 
be. 

In machine learning, a pattern is a set of attributes that 
represents a data point x. Let us assume, x = {xi,X 2 , ■■■,Xn) to 
be the pattern, with Xi,i = {1,2,...,«} being the features of x. 
Let us assume that these patterns correspond to P number of 
classes, denoted as y;,y,' G {1,2, ...,P} & i G {1,2, ...,k} . The 
graphical representation of a basic pattern classiher is shown 
in Fig{2 



Fig. 1. Basic representation of pattern classifier. 


Pattern classihcation methods are of two types, supervised 
methods and unsupervised methods. The major difference 
between supervised and unsupervised methods is the process 
of learning, during which the characteristics of the data are 
learned by the classiher. In supervised classihcation methods, 
the pattern x = {xi,X 2 , along with its associated label or 

class yi,yi G {1,2,... ,P} & i G {l,2,...,k}, form the training 
dataset 5,{(xi,y;),f = {1,2,...,k}}. During the training phase, 
the classiher learns from the existing patterns with their 
corresponding labels. The trained classiher can then be used 
to predict the labels for the new unseen data or test data. On 
the contrary, unsupervised methods do not use labels y,- along 
with the patterns Xj during training. The unsupervised methods 
estimate the hidden patterns in the data to group the given data 
into several groups or clusters. Hence, unsupervised methods 
are also referred to as Clustering Methods. 

This study used two supervised methods. Support Vector 
Machines (SVM) and naive Bayes classihers to classify causal 
and non-causal patterns. Let x = {x\^X 2 : ...,Xn) denote a causal 
or a non-causal pattern, with x/,i = {1,2, ...,n} being the Bag 
of Words (Bow) of k patterns x;, / = {1,2, ...,k}. These patterns 
correspond to 2 number of classes, denoted as y, G { —1,+1}. 

In the following sub-sections, the classihers and their eval¬ 


uation techniques are discussed. The hgures in the section II- 


A. 1 are adapted from ’’Learning with Kernels” [52] and ’’kernel 


methods for pattern analysis” [54]. 

1) Support Vector Machines (SVM): Kernel Support Vector 
Machine (SVM) is a widely used pattern classihcation method 
and is well known for accurate and effective pattern classih¬ 
cation [38], [66], [67]. 



Fig. 2. The optimal separating hyperplane h in linearly separable binary 
classification using support vector machine (SVM). Support vectors are shown 
in the highlighted circles that lie on the hypeiplanes (dotted lines, h\ and hi) 
that have unit distance to the optimal separating hyperplane (solid line, h). 


Let (X,Y),X C /?„;Y G {—1,-|-1} denote training data S in 
a two-class classihcation task. Each point x G X is associated 
with one of the possible classes Y G {—1,-|-1}. The goal of 
the SVM is to classify a new data point x' to one of the 
possible classes. In probabilistic notation, the likelihood that a 
new point x' belongs to a given class, y' G {-fl, —1}, can be 
represented as, 

p(y = +i|x' = x), 
p(y = -i|x' = x). 


Now, the classiher / ; X —?> Y estimates the representation 
of the discriminant function. During training, the function / 
has to minimize the probability of misclassihcation of all data 
points in the training data. 

SVM solves this problem by hnding the function /, 
which for every point (xi,y,);xi = [x/i,x,- 2 ,. • • ,x;„]^ G X,y,- G 
{—1,-|-1}, in the training set satishes. 


/(xi) >0,ify, = -fl, 

/(Xi)<0,ify, = -1. (1) 


Eq.Q is only possible if there exists a hypersurface h, which 
can separate the data into two classes either linearly or non- 
linearly. 

Linearly separable Binary classification (maximal mar¬ 
gin) 

Let us assume that we have a linearly separable training data 
set, S = {(xi,y;)}, i = 1,2,...,A:, where Xj is any single data 
point and y; is the corresponding class label of Xj and there 
are k data points in S. The decision function sgn(g(x)) is equal 
to the sign of the g(x), where g(x) is any function of x. 


sgn(g(x)) 


+ l,g(x) >0, 
-l,g(x) <0 


( 2 ) 


Eor the given set of training data S, there exists a linear 
discriminant function / of the form, 

/ ; X —> w^x-fh. 










Fig. 3. The optimal separating hyperplane, h in the linearly non-separable 
binary classification using support vector machine (SVM). Support vectors 
are shown in the highlighted circles that lie on the hyperplanes (dotted lines, 
hi and h 2 ) that have unit distance to the optimal separating hyperplane (solid 
line, h). 

where, w G M", h G K is a constant and the corresponding 
decision function, t = sgn(w^x + b) should have zero error. 
This means, all the k data points in the training data set, S 
should satisfy the decision function t. So, it is possible that 
there exists inhnite number of such hyper-planes (h) that can 
separate the two classes with zero error. The goal of SVM is to 
maximize the minimal distance between the two hyperplanes 
{hi and h 2 ) that can separate the data (minimal margin, as 
shown in Fig. of the linear discriminant function / with 
respect to the training data set S [25]. 

™nxiex|w^Xi + h|. 

The geometric margin 7 for the discriminant function is 
dehned as. 


From Eq.([^, it is clear that maximizing the minimal geo¬ 
metric margin reduces to minimizing the norm of the weight 
vector, ||w|p. The hyperplane that maximizes the minimum 
margin and satisfies 

y,(w^Xi-fh) > 1,1 = {l,2,...,k}, (4) 

is called the optimal separating hyperplane [25]. 

From Eq.([^ and Eq.Q the optimal separating hyperplane 
can be represented as follows, 

min„ 

such that ,y,(w^Xi + h)>l,l = {l,2,...,l:}. (5) 

The above minimization problem can be solved as a dual 
optimization problem using Lagrangian, L. 

1 ^ 

L = min„maxa { x 11W11 ^ [y; (w^x-h)-l ]}, ( 6 ) 

^ (=1 

where a, is the lagrangian multiplier. 


The solution of the above optimization problem Eq.(|^ 
defines a linear optimal separating hyperplane defined by the 
parameters, 

w= afUiji, 

v.a>0 

b = — - [min(w^x,) + max (w^x,)]. 

2 3', = ! 

Training vectors Xj, for which a, are strictly positive are 
called support vectors. These support vectors lie on hyper¬ 
planes at unit distance from the optimal separating hyperplane 
(as shown in Eig|^. Using the above optimized vt> and b, the 
classiher t is defined as, 

t (x) = sgn (w^x + b). 

Linearly non-separable Binary classification (Soft- 
margin) 

The perfect linear separability is not realistic. Therefore, we 
still need to solve the optimization problem to hnd optimal 
linear discriminant function. Allowing a certain amount of 
misclassihcation, and punishing the misclassihed data points 
during the optimization helps us to resolve linear separability. 
The amount by which the discriminant function fails to reach 
the unit margin is termed as the error of observation, ^ (as 
shown in Eig 0 . 

= max{0, l-y,(w^Xi -fh)}. 

The misclassihcation takes place when > 1. 

Eor the linearly non-separable data, the optimal separability 
hyperplane has to maximize the geometric margin and mini¬ 
mize the error function 0 (i^). 

1=1 

Considering the error the constraint Eq.(j^ can be written 
as, 

y,(w^Xi-l-h) > l-^i,i = {1,2, and > 0 . 

Now the optimization problem (Eq|^ can be written as, 

(1 /2) 11WI i ^ + C it 1 , 

such that , yi{-w'^Xi + b) > l-t-,andt- > 0;i = { 1 , 2 ,.. . ,k}{7) 

where C is a positive parameter, which dehnes the importance 
of misclassihcation errors. 

To solve the optimization problem of constraint (Eq|^, we 
consider solving the corresponding dual problem with the 
objective function to be maximized, 

k 1 ^ 

lT(a) = y^a,-- a,ajy,y,(xtxj), 

i=i ^ ij=i 

k 

0 < «; < C,i = {1,2, ...,k};andy^ = 0. ( 8 ) 

1=1 

Erom the constraint Eq.(j^, a, = C if and only if t > 0, and 
the vectors x, with t > 0 are called support vectors. 


R- • • >' = +1 
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Fig. 4. A non-linear SVM can be interpreted as a lineai* SVM in a non- 
lineai'ly mapped space. d>(.) defines the non-linear mapping of data from a 
lower dimension to a higher dimension. 


“Kernel trick” for Non-linear SVM” 

The "kernel trick’ in the context of SVM is a non-linear 
transformation to map the data in the low dimensional space 
onto a higher dimensional space (as shown in Fig|^. By non- 
linearly mapping the data onto the higher dimensional space 
with an appropriate kernel, it is supposed that the original lin¬ 
early non-separable data becomes linearly separable [52], [54]. 
Since SVM learning needs only the inner product between 
data points, the non-linear transformation does not apply to 
individual points in the training set, there by maintaining the 
efficiency of SVM using the kernel trick. The kernel based 
SVM often outperforms the original SVM for linearly non- 
separable classification tasks [8]. The standard kernels are as 
follows: 

• Linear Kernel 


K(xi,Xj) =<Xi,Xj > . (9) 

• Gaussian Kernel 

K(xi,Xj)=exp(^-^'i^5!J^y (10) 

• Polynomial Kernel 


where y, is the possible value of y, xt is the possible 
vector of x. 

During learning P(x|y) and P{y) can be estimated using 
the training data. Using these estimates, together with Bayes’ 
rule in Eq.(12i, we can determine P(y|x = Xk) for any new 


data point xt. Bayesian classifiers are computationally very 
expensive; however the Conditional Independence assumption 
of naive Bayes algorithm drastically reduces the number of 
parameters to be estimated when modeling P(xk|y), from 
2 (2n-l) to 2n. 

Conditional Independence: Given random variables x, y, and 
z; X can be called conditionally independent of y given z, if 
and only if, the probability distribution of x is independent of 
the value y given z. 


(yij,k)P{x = Xi\y = yj,z = Zk)=P{x = Xi\z = Zk)-il4-) 


The naive Bayes algorithm assumes the attributes 
xi,X 2 ,---,x„ which are all conditionally independent of one 
another given the class y. Considering the Conditional Inde¬ 
pendence assumption of naive Bayes, we have 

n 

P{xi,X2,.:,Xn\y) = 

r=l 


Now, using Bayes rule and the conditional independence 
property (Eq 14 1 , the probability that y takes the possible 
value given x is given by, 

P(.y = yk)I\iPixi\y = yk) 


Piy = yk\xi,X2,..;Xn) = 


LjP{y = yj) ni-P(x,'|y = yj) ■ 


(16) 


During the training, the distributions P(y) and P{xi\y) are 
estimated. Given the attributes of x' (a new data point), the 
most probable value of y given x' can be estimated as. 


y ^ argmaXy^P{y = yk)Y[P{xi\y = yk)- (17) 


K(xi,Xj) = (<Xi,Xj >-fc)^. ( 11 ) 


2) Naive Bayes classification: The naive Bayes Classifier 
is a supervised learning method based on Bayes Rule of 
probability [38]. Naive Bayes classification algorithms are 
currently some of the most used pattern recognition algorithms. 
It is popular for its quick training speeds and high accuracies 
[38], [34], [4]. 

According to Bayes rule, the posterior belief 7’(y|x) is 
calculated by multiplying the prior P{y) by the likelihood 
P(x|y) that X will occur if and only if y is true. Bayes rule 
is given by. 


P(y|x) 


P{^\y)Piy) 

P(x) 


( 12 ) 


Consider a supervised learning problem, f :ii^y. To learn 
P(y|x), we need to approximate the target function /. Let 
us assume, x = (xi,X 2 ,...,Xn), where Xj is a Boolean random 
variable denoting the attribute of x and y is a Boolean valued 
random variable. Applying Bayes rule Eq.( 12 1 to P{y = y,|x = 
Xk) can be represented as. 


P(y =y,.|x = Xk) 


P(x = Xk|y = yOP(y = y/) 

I^-P(x = xk|y = yj)P{y = yj) ’ 


(13) 


3) Evaluation: pattern classification method: Machine 
learning algorithms induce classifiers that depend on the 
training set. So there is a need for evaluation and statistical 
testing to assess the expected error rate of a classification 
algorithm. Additionally evaluation is crucial to compare the 
expected error rates of two classification algorithms to identify 
the better performing one. Evaluation can also be used as a 
guide for future improvements on the model. The technique 
here is to generate a test-set, whose labels are already known. 
This test-set has to be distinct from the train-set which has 
been used to train the classifier. The test-set is then labelled by 
the classifier and the labels that it decides are being compared 
with their correct labels. 

Additional techniques have been implemented in order to 
get more accurate evaluations and avoid possible ’over-fitting’. 
There is a chance that the classifier will become more accurate 
in the train set and less accurate in the test set with some 
parameter changes. This is when over-fitting occurs to the train 
set. 

k-Fold Cross Validation Cross-validation is a method of 
evaluating learning algorithms by segmenting the data into 
several folds, where the folds are either training or validation 
sets. Each training set is used to train a model while the 
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Fig. 5. 10-fold cross-validation procedure. The light-blue folds represent the 
validation folds, while the remaining represent the training folds. 
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validation set is used for validating the performance of the 
trained model. Performance is measured as accuracy averaged 
over all folds. 

The most basic form of cross-validation is A:-fold cross- 
validation [4], where the data is first partitioned into k folds of 
equal or nearly equal size. Subsequently k iterations of training 
and validation are performed, such that for each iteration, the 
model is validated against a different fold and trained on the 
k— 1 folds, as illustrated in Figure 

The next step is to determine the suitable value of k. Large 
k is desirable since it yields more performance estimates. 
However it also yields a lower validation set size, leading to 
less precise measurements of the performance metric. In data 
mining community, there is general consensus that k= 10 is a 
good compromise of these factors, where making predictions 
using 90% of the data makes it more likely to be generalized 
to the full data [24]. 

The results of cross-validation can yield misleadingly low 
error estimates. The detailed discussion of pitfalls in connec¬ 
tion with cross-validation is found in [12]. In this study k= 10 
is used. 

Performance Measurements Consider a binary classifier (a 
predictor) that classifies each pattern in a data set into two 
classes, either positive (P’) or negative (N’), while the ground 
truth is either positive (P) or negative (N). The performance of 
the classifier can be represented in terms of these four possible 
classification results; 

True positive (TP): the result is positive (P’) while the 
ground truth is also positive (P) 

False positive (FP); the result is positive (P’) but the ground 
truth is negative (N) 

True negative (TN): the result is negative (N’) while the 
ground truth is also negative (N) 

False negative (FN): the result is negative (N’) but the ground 
truth is positive (P) 

All such symbols can be also treated as the number of 
patterns that belong to each of the cases, and we have 

r P' = TP + FP f P = TP + FN 

[ N' = TN + FN ’ \ N = TN + FP 

The four cases of the classification result can be represented 
by the following 2 by 2 confusion matrix (see Figure]^. Each 
column of the matrix represents the instances in a predicted 
class, while each row represents the instances in an actual 
class. Thus, the diagonal entries indicate labels that were 
correctly predicted, and the off-diagonal entries indicate errors. 
One benefit of a confusion matrix is that it is easy to see if 
the system is confusing two classes. 


Fig. 6. A simple confusion matrix. 


Based on these concepts, we can further define the following 
performance measurements (all in percentage between 0 and 
1). Sensitivity and specificity are statistical measures of the 
performance of a binary classification. Sensitivity (true positive 
rate, or recall) measures the proportion of actual positives 
which are correctly identihed. Specificity measures the pro¬ 
portion of negatives which are correctly identified. An ideal 
classiher should have 100% sensitivity and 100% specificity. 

The recall and the precision can be derived from the confu¬ 
sion matrix by applying the formulas from the Table Recall 
describes the completeness of the classihcation. Precision 
defines the actual accuracy of the classification. 


Method 

Formula 

Accuracy 

i t'+i ^ 

/ICC — 

Error rate 

ERR - - 1 ACC 

Recall or True positive 
rate or Sensitivity 

TPR or Re - ‘f - 

Precision 

n,- 1 1 

r! — -pT — jp+pp 

True negative rate or 
Specificity 

TN TN 

1 i\iy — ^ — pp_^_pp^ 

False positive rate 

= F7fm = l-T = l-™« 

False negative rate 



TABLE I 


Performance measurement methods 


While recall and precision rates can be individually used to 
determine the quality of a classifier, it is often more convenient 
to have a single measure to do the same assessment. The F- 
measure combines the recall and precision rates in a single 
equation: 


F^2* 


precision * recall 
precision -f recall 


F-measure for Cross Validation In the previous subsection, 
the general formula for calculating the F-measure was dis¬ 
cussed. [12] gave a description of three different combination 
strategies for cross-validation which allow different ways of 
handling F-measure, one of them being unbiased. 

The hrst combination starts with simply averaging of F- 
measures. In each fold the F-measure is recorded as F^'l and 
the final estimate is calculated as the mean of all folds. 


1 ^ 

F — - V eW 

Cavg 

1=1 


The second combination considers, averaging precision and 
































recall across all the folds. Hence, the hnal estimate of F- 
measure can be given as follows: 


Pr := 

^ (=1 


Fpr,re '-=2* 


Pr*Re 
Pr + Re 


The third and hnal combination considers averaging of 
true positives and false positives across all the folds. This 
combination is also considered to be unbiased according to 
the authors. 


1 ^ 

TP:= 

^ i=\ 

^ (=1 
1 ^ 

FN:=-y FN^‘^ 

^ {2* TP) 

tpjp- 2*TP + FP + FN 

On the evidence provided by the article [12], this study used 
unbiased F-measure (Ftpjp) to evaluate the performance of the 
/T-fold cross validation. 


CAUSE (REASON) 

TRANSITION 

EFFECT (RESULT) 

She had no other options. 

Consequently, 

she married at thirteen. 

She was not protected. 

As a result. 

she had a Baby at thirteen. 

She had no access to 
health education or med¬ 
ical clinics. 

Therefore, 

she was more likely to get 
HIV. 

There was poor sanitation 
in the village. 

As a consequence. 

she had health problems. 

The water was impure in 
her village. 

For this reason. 

she suffered from para¬ 
sites. 

She had no shoes, warm 
clothes or blankets. 

For all these reasons. 

she was often cold. 

She had no resources 
to grow food.(land, 

seeds,tools) 

Thus, 

she was hungry. 

She had not been given a 
chance, 

so 

she was fighting for sur¬ 
vival. 


TABLE II 


Cause (reason) and efeect (result) with transition. 


EFFECT (RESULT) 

CONIUNCTION 

CAUSE (REASON) 

She married at thirteen 

because 

she had no other options. 

She had a baby at thirteen 

as 

she was not protected. 

She was more likely to get 
HIV 

since 

she had no access to 
health education. 

She had health problems 

because of 

poor sanitation in the vil- 
lage. 

She suffered from para¬ 
sites 

on account of 

the impure water in her 
village. 

She was often cold 

due to 

not having shoes, warm 
clothes or blankets. 

She was hungry 

for the reason that 

she had no resources to 
grow food. 

She was fighting for sur¬ 
vival 

since 

she had not been given a 
chance. 


Choice of Parameters: Most supervised learning algorithms 
include one or more conhgurable parameters. The problem is 
to identify the suitable values for these parameters. Generally, 
a hnite set is dehned with alternative values for each parameter. 
Then, the simplest approach is to run the algorithm with the 
same training data for each combination of parameter values 
and measure performance each time on the same validation 
set [20]. The parameters that give the best performance on 
validation set are chosen. 

B. Connectives Method 

The words which are used to connect the cause and effect 
in sentences are called connecting words. There are approx¬ 
imately a list of 200 commonly used English connecting 
words [9]. These words introduce a certain shift in the line of 
argument. Connectives method involves extracting the causal 
sentences using these connecting words. These connecting 
words are usually a transition or a conjunction [21], [48] or a 
verb phrase [15]. The examples in this chapter are taken from 
the grammar-quizze^ website. 

1) Transitions: Transitions are phrases or words used to 
connect one idea to the next [16]. They may be ’’Additive”, 
’’Adversative”, ’’Causal”, or ’’Sequential” [74]. This study con¬ 
siders transition words as words which after a particular time, 
show a consequence or an effect. More detailed information 
regarding the transition words can be found in [9]. Table |n] 
shows the terms which serve as a transition from one sentence 
to the next. 

'http://WWW.grammar-quizzes.com/19-2.html 


TABLE III 

Efeect (result) and cause (reason) with conjunction. 


2) Conjunctions: Conjunctions are the connecting words 
that are often used to join two complete sentences. The 
conjunctions, that are used to connect the cause and effect 
sentences are ’because’, ’as’, ’since’ and ’so’. ’Because’, 
’as’, and ’since’ introduce a cause and ’so’ introduces an 
effect. Hence these are used to join two independent clauses 
together [74]. As shown in Table |I^ ’because’ and other 
conjunctions, join one clause with another clause. Conjunction 
introduces a cause (reason) for the situation stated in the other 
clause. 


3) Verb Phrases: Verb phrases are the part of a sentence 
containing the verb and a object [74]. These can be used as 
connecting words to join two noun phrases i.e <Noun Phrase 
IxVerb PhrasexNoun Phrase 2>. This syntactic structure 
serves as a causal relation, where the verb phrase acts as a 
causal verb or reflects a resulting effect in the object. 

Table IV shows causal relations with verb phrases. Here 
the verb phrase introduces the effect in the cause and result 
expressions. Both verbs ’’cause” and ’’result” are used in the 
active form. 

In Table |V] both verbs ’’cause” and ’’result” are used to 
introduce a cause. The verb cause may be used in the passive 
form with a ”by phrase”. The verb result does not take the 
passive form. Instead, it is followed by a prepositional phrase 































CAUSE 

VERB PHRASE 

EFFECT 

(REASON) 


(RESULT) 

Poor childhood 
education 

causes 

illiteracy. 

Poor childhood 
education 

results 

in illiteracy. 


TABLE IV 


Cause (reason) and effect (result) with verb phrases. 


Accident lype 

Documents 

Collisions 

55 

Groundings 

44 

Machinery failures 

21 

Fire 

15 

Total 

135 


TABLE VI 

Accident types and number oe reports addressed in this study. 


’’from”. 


III. Data, Processing & Representation 


EEFECT 

VERB PHRASE 

CAUSE 

(RESULT) 


(REASON) 

Illiteracy is 

caused 

by poor child¬ 
hood education. 

Illiteracy 

results / is resulted by 

from poor child¬ 
hood education. 


TABLE V 

Effect (result) and cause (reason) with verb phrases. 


[15] extracted causal relations which included this syntactic 
structure. Using this method, they achieved approximately 
66 % recall on a test corpus generated from an archive of 
Los Angeles Times articles. They classified the verb phrases 
present in causal relations in to four categories: 

• Low ambiguity and high frequency (LAHF). 

• Low ambiguity and low frequency (LALF). 

• High ambiguity and low frequency (HALF). 

• High ambiguity and high frequency (HAHF). 

The verb phrases which have LAHF are as follows: ’’cause”, 
’’affect”, ’’induce”, ’’produce”, ’’generate”, ’’affect”, ’’arouse”, 
’’elicit”, ’’lead to”, ’’trigger”, ’’derive”, ’’associate”, ’’relate to”, 
’’link”, ’’originate”, ’’bring on”, and ’’result”. 

This study concentrates only on verb phrases such as 
’’cause” and ’’result”, since they have no ambiguity. 

4) Evaluation for connectives method: In the context of 
connectives method, precision and recall are dehned in terms 
of a set of retrieved causal sentences (e.g. all the causal 
sentences marked by the automatic algorithm (A)) and a set of 
relevant causal sentences (e.g. the causal sentences that marked 
by expert only (E)). 

In here, precision is the fraction of retrieved causal sentences 
that are relevant to the expert. And recall is the fraction of 
expert marked causal sentences that are successfully retrieved. 
It is trivial to achieve recall of 100% since causal sentences 
marked by expert and algorithm are not always the same. 
Therefore, recall alone is not enough but one needs to measure 
the number of non-relevant causal sentences according to 
expert. These two measures are used together in the F-measure 
to provide a single measurement for a system. 

Retrieved := Algorithm marked causal sentences (A). Rele¬ 
vant := Expert marked causal sentences (E). 


Precision := 


Recall := 


(EnA) 

(ERA) 


The data used in the study is ’MAIB accident investigation 
reports’. Marine Accident Investigation Branch (MAIB 
is a branch of the Department for Transport located in 
Southampton, UK. MAIB has four teams of experienced 
accident investigators, each comprising a principal inspector 
and three inspectors drawn from the nautical, engineering, 
naval architecture or hshing disciplines. The role of the MAIB 
is to contribute to safety at sea by determining the causes and 
circumstances of marine accidents and working with others 
to reduce the likelihood of such accidents recurring in the 
future [7]. 

There are 11 categories of accident investigation reports 
which are Machinery, Eire/Explosion, Injury/Eatality, Ground¬ 
ing, Collision/Contact, Elooding/Eoundering, Listing/Capsize, 
Cargo Handling Eailure, Weather Damage, Hull Defects and 
Hazardous Incidents. But this study concentrates only on 4 
types of accident types with a total of 135 investigation reports 
as shown in the Table |VT] Each report, on an average contains 
60 pages which are divided into 3 sections viz: 1) narrative 
2) analysis and 3) conclusions. Narrative section describes the 
summary of the accident, while every possible detail regarding 
the accident is analyzed in the analysis section. 

A. Preprocessing 

A maritime accident investigation report is written in a 
natural language, by different investigating officers and hence 
does not follow a standard reporting format. This makes 
the investigation reports inconsistent and noisy. If data is 
inconsistent, the text mining algorithms under-perform. The 
text data also contains some special formats like number 
formats, date formats and the most common words that are 
unlikely to help text mining such as prepositions, articles, and 
pronouns that are to be eliminated. In order to extract data 
which is consistent and accurate, data preprocessing methods 
are crucial. 

This section of the study reviews some simple NLP process¬ 
ing tasks that are used in the experiments, such as, tokeniza- 
tion and stemming using Natural Language Toolkit (NLTK). 
The NLTK, is a suite of Python libraries and programs for 
symbolic and statistical natural language processing [31], [30]. 
NLTK includes graphical demonstrations and sample data. It 
is accompanied by extensive documentation. 

Some times the data is in Portable Document Eormat (PDE) 
and processing a PDE hie is difficult. Hence, conversion of 
data from PDE to TXT format is crucial. 

^http://WWW.maib.gov.uk/home/index.cfm 

















1) Tokenization: The aim of the tokenization is to explore 
the words in a sentence [71]. Textual data is only a block of 
characters at the beginning. But all the following processes in 
text classification require the words of the dataset. Hence, the 
tokenization is a pre-requisite for data processing [39]. 

This may sound trivial as the text is already stored in 
machine-readable formats. Nevertheless, some problems are 
still left, like the removal of punctuation marks. Other charac¬ 
ters like brackets, hyphens, etc. require processing as well. Fur¬ 
thermore, the text should be lower cased to cater consistency 
in the documents. The main use of tokenization is identifying 
the meaningful significant words. Inconsistency can arise from 
different number formats or time formats. Another problem is 
abbreviations and acronyms which have to be transformed into 
a standard form. 

The following three-line program imports the tokenize 
package, dehnes a text string, and then tokenizes the string on 
whitespace to create a list of tokens. Here ’ >>> ’ is Python’s 
interactive prompt; ’ . . . ’ is the second-level prompt. 

>>> from nltk_lite import tokenize 

>>> text = 'Hello world. This is a test.' 

>>> list(tokenize.whitespace (text)) 

['Hello', 'world.', 'This', 'is', 'a', 'test'] 

2 ) Stop Words: In text mining, most frequently used words 
or words that do not carry any information are known as 
stopwords [37]. A example stoplist in English is shown in 
Figure]^ Typically a stop list constitutes about 400 to 500 such 
words and accounts for 20-30% of the total word counts [76]. 
Hence, it important to remove stopwords in improving the 
effectiveness and efficiency of an application. For an appli¬ 
cation, an additional domain specific stopwords list may be 
constructed [35]. Most researches remove the stopwords using 
a standard stopword list. An alternate way is to remove the 
most frequent words. 

a an and are as at be by for from 

has he in is it its of on that the 

to was were will with 

Fig. 7. A stop word list of 25 semantically non-selective words which are 
common in Reuters-RCVl dataset. 

3) Stemming: Stemming refers to the process of reducing 
terms to their stems or root variants [42]. For example; 

• agreed — > agree 

• meetings, meeting — > meet 

• engineering, engineered, engineer — > engine 

In statistical analysis, it helps greatly when comparing texts 
to identify words with a common meaning and form as being 
identical. For example, the words ’stopped’ and ’stopping’ 
stem from the same word ’stop’. Stemming identifies these 
common forms and reduces the computing time as different 
form of words is stemmed to form a single word. The most 
popular stemmer in English is Martin Porter’s Stemming 
Algorithm [46], as shown to be effective in many cases [13], 
[44], [55]. 

The following simple code demonstrates the stemming pro¬ 
cess using NFTK; 


>>> text = 'stemming can be fun and exciting' 
>>> tokens = tokenize.whitespace(text) 

>>> porter = tokenize.PorterStemmer() 

>>> for token in tokens: 

... print porter.stem(token), 

stem can be fun and excit 

There are a few demerits of stemming. Firstly information 
about the full terms is lost. Secondly there is a trade-off 
between two main errors in stemming i.e 1) over-stemming and 
2) under-stemming. Over-stemming occurs when two words 
with different stems are stemmed to the same root. This is 
also known as a false positive. Under-stemming happens when 
two words that should be stemmed to the same root are not. 
This is also known as a false negative. [40], [41] showed that 
light-stemming reduces the over-stemming errors but increases 
the under-stemming errors. On the other hand, heavy stemmers 
reduce the under-stemming errors while increasing the over¬ 
stemming errors. 

4) Zipf’s Law: Zipf’s law is the observation of [81] on 
the distribution of words in natural languages. It describes the 
word behavior in an entire corpus and can be regarded as a 
roughly accurate characterization of certain empirical facts. 
According to Zipf’s law. 

Frequency * rank = constant. 

Suppose f{w) is the frequency of a word w in free text. 
Here, frequency is the number of times a word occurs in a 
corpus. If we compute the frequencies of the words in a corpus, 
and arrange them in decreasing order of frequency, then the 
product of the frequency of a word and its rank (its position in 
the list) is more or less equal to the product of the frequency 
and rank of another word. So frequency of a word is inversely 
proportional to its rank. That is, the frequency of words 
multiplied by their ranks in a large corpus is approximately 
constant. For example, the 50th most common word type 
should occur three times as frequently as the 150th most 
common word type. 

Researchers [75], [11], [10], [5] used the Zipf’s law to 
experiment on a large corpus. They found that only a small 
number of words occur more often than a large number of 
words that occur with low frequency. Between these two 
extremes there are medium frequency words as well. This 
distribution has its impact only on medium frequency words, 
having content-bearing terms. Common practice is to drop low 
frequency words as it has less discriminating power while the 
high frequency words are dropped using stop word list. 

5 ) Bag of Words Model: The Bag of Words (BoW) model 
is a simplified text representation used in information retrieval 
(IR). In this model, a text is represented as an unordered 
collection of words, disregarding grammar and even word 
order. This model is commonly used in methods of document 
classihcation, where the occurrence of each word is used as a 
feature for training a classifier. 

Text document representation based on the BoW model; 

Here are two simple text documents; 

• John likes to watch movies. Mary likes too. 

• John also likes to watch football games. 

Based on these two text documents, a dictionary is con¬ 
structed as; 


{"John": 1, "likes": 2, "to": 3, 

"watch": 4, "movies": 5, "also": 6, 
"football": 7, "games": 8, "Mary": 9, 
"too": 10} 

which has 10 distinct words. And using the indexes of the 
dictionary, each document is represented by a 10-entry vector: 

[ 1 , 2 , 1 , 1 , 1 , 0 , 0 , 0 , 1 , 1 ] 

[ 1 , 1 , 1 , 1 , 0 , 1 , 1 , 1 , 0 , 0 ] 

where each entry of the vectors refers to count of the 
corresponding entry in the dictionary. This vector represen¬ 
tation does not preserve the order of the words in the original 
sentences. This study used Zipf’s Law to obtain the dictionary, 
by removing the low frequency (< 5) words to avoid a big 
feature space [37]. 

B. Document Representation 

A major challenge of the text classification problem is the 
representation of a document. It is the final task in document 
preprocessing. The documents are represented in terms of 
those features to which the dictionary was reduced in the 
precedent steps. Thus, the representation of a document is a 
feature vector of n elements where n is the number of features 
remaining after finishing the selection process. 

When choosing a document representation, the goal is to 
choose the features that allow document vectors belonging to 
different categories to occupy compact and disjoint regions in 
the feature space [70]. There exist different types of informa¬ 
tion that can be extracted from documents for representation. 
The simplest is the Bag-of- Words representation (BoW) in 
which each unique word in the training corpus is used as 
a term in the feature vector. Second type is the categorized 
proper names and named entities (CAT) that only uses the 
tokens identified as proper names or named entities from the 
training corpus used for representation [77]. 

A comprehensive study by [3] surveys the different ap¬ 
proaches in document representation that have been taken thus 
far and evaluates them in standard text classification resources. 
The conclusion implies that more complex features do not 
offer any gain when combined with state-of-the-art learning 
methods, such as Support Vector Machines (SVM). 

1) Vector Space Model: Vector Space Model (VSM) is a 
classical approach applied on text documents to obtain a matrix 
of numbers. VSM has some severe drawbacks, resulting from 
its main assumption, reducing texts written in natural language, 
which is very flexible to strict mathematical representation. 
These problems, along with their possible solutions are dis¬ 
cussed in this section. 

The vector space model is based on linear algebra and 
treats documents as vectors of numbers, containing values 
corresponding to occurrence of words (also called terms) in 
respective documents [51]. Let t be size of the terms set, and 
n be the size of the documents set. Then, all documents Di, 
i— 1, • • • ,n may be represented as f-dimensional vectors: 

Di=[aiuai 2 ,--- ,ait] (18) 

where coefficients a,i represent the values of term k in 
document Z), [51]. Thus both documents and terms form a 


term-document matrix A(„x«)- Rows of this matrix represent 
documents, and columns represent term vectors. Let us assume 
that position is set equal to 1, when term k appears in 
document i, and to 0 when it doesn’t appear in it. For example, 
documents corresponding to a query ’’king”, the corresponding 
term-document matrix can be created as shown in Table IVIII 
Documents set: 

Dj The King University College 

D 2 '. King College Site Contents 

Dy. University of King College 

Dy. King County Bar Association 

D 5 : King County Government Seattle Washington 

Dy. Martin Luther King 

Terms set [Ti,T 2 , ■ ■ • , Tisj: The, King, University, College, Site, 
Contents, of. County, Bar, Association, Government, Seattle, 
Washington, Martin, Luther 

2) Merits and Demerits: VSM: Using linear algebra as the 
basis of the vector space model is a merit. After transforming 
documents to vectors linear algebraic mathematical operations 
can be easily applied. Simple, efficient data structures may be 
used to store data. Representation of documents in the vector 
space model is very simple. However, often these vectors are 
sparse, i.e. most of contained values are equal to 0. Hence, 
sparse vectors could be used to save memory and time. 

In basic vector space model, only occurrence of terms in 
documents is of importance and their order is not considered. 
It is the main reason why this approach is often criticized [79], 
[72], as the information about the proximity between words 
(their context in sentence) is not utilized. Consider for example 
two documents: one containing a phrase ’’White House”, which 
has a very specific meaning, and another containing a sentence 
”A white car was parked near the house”. Treating documents 
simply as sets of terms we only know that words ’’white” and 
’’house” occur in both documents, although their context there 
is completely different. However, this problem can be easily 
overcome - one can supplement this model, using also phrases 
in addition to terms in document vectors, as described in [32] 
and [33]. 


C. Term Weights 

The process of calculating weights of terms is called terms 
weighting. It addresses how important a term is with respect 
to a document (since not all terms are equally informative 
about the contents of the document). According to [19], 
term weighting is used to describe and summarize document 
content based on a term’s importance. There are several main 
methods used to assign weights to terms. The simplest method 
is boolean terms weighting, which, as its name suggests, sets 
weights to 0 or 1 depending on the presence of term in 
document. This method is used to calculate the term-document 
matrix in example shown in Table VII Using this method 
causes loss of valuable information, as it differentiates only 
between two cases: presence or absence of term in document, 
and exact number of occurrences of word may indicate its 
importance in documents. 

The method utilizing knowledge of exact number of term 
occurrences in documents is called TF term weighting (TF 
stands for Term Frequency). TF is the total count of the 
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TABLE VII 


Term-document matrix for an example document collection. 


particular word repeated in a document and is calculated as 


k 

where n, ) is the number of times the term f, occurs in 
document dj and the denominator is the sum of number of 
times all terms occur in document dj [37]. 

Document Frequency (DF) is defined as the total number of 
times the word occurs in the entire documents i.e. number of 
documents containing the significant word. On the other hand. 
Inverse Document Frequency (IDF) is a measure of whether 
the term is common or rare across all documents [50]. It is 
obtained by dividing the total number of documents by the 
number of documents containing the term, and then taking the 
logarithm of that quotient. 

idfi^log ( 20 ) 

\d : ti G d\ 

here |D| is the total number of documents in the collection 
and \d : tj G d\ is the number of documents where the term f,- 
appears [37]. 

TFIDF has three assumptions that, in one form or another, 
will appear practically in all weighting methods: 

• IDF assumption : ’’rare terms are no less important than 
frequent terms”. 

• TF assumption : ’’multiple appearances of a term in a 
document are no less important than single appearances”. 

• Normalization assumption : ’’for the same quantity of term 
matching, long documents are no more important than 
short documents”. 

A classical term weighting method that takes into account 
both term and document frequencies is called tf-idf terms 
weighting, and is probably the most popular approach in 
information retrieval systems [43], [55]. Term weight in this 
method is calculated as a product of its term and inverse 
document frequencies, hence its name. 

TF-IDFij=tfijxidfi (21) 

IV. Experiments and Results 

This chapter describes the procedure and results for ex¬ 
tracting the causal relations using pattern classification and 
connectives methods. 

A. Pattern Classification Method 

The implementation of pattern classification method and its 
results are discussed in this section. 


1) Dataset Collection: The dataset is the collection of 
causal relations marked by three domain experts. The experts 
marked a total of 151 causal sentences in four accident investi¬ 
gation reports. These 151 causal sentences with an addition of 
151 non-causal sentences from the same accident investigation 
reports are combined to form a complete dataset containing 
302 sentences. Out of them, 70% i.e 212 sentences(106 causal 
and 106 non-causal) are considered as the training set and 
remaining 30% i.e 90 sentences (45 causal and 45 non-causal) 
are considered to be test set. 

2) Data Preprocessing: The documents collected are con¬ 
verted from PDF to TXT format. The training data in TXT 
format needs to be tokenized as explained in Chapter 3, after 
which the stop words are removed. The list of stop words 
used in the study is 416 words including single characters and 
excluding the transitions, conjunctions and verbphrases listed 
in section 2.1. Before removing stop words the total number 
of terms from 212 sentences is 6790. After removing stop 
words, remaining number of terms are 3414. In the next step, 
stemming is performed and unique words are recorded. Words 
that occur 5 times or less are also removed in this process. 
Finally we are left with a list of significant words. Hence our 
final features are a total of 990 significant terms. 

3) Data Representation: Using these 990 significant terms 
a train:document-term matrices are constructed based on TF 
and TFIDF weights. Similarly test dataset is tokenized and 
stemmed using Porter’s stemmer. Based on significant terms 
collected from the training set, the test:document-term matrices 
are constructed for both TF and TFIDF weights. 

4) Classifiers: The train:document-term matrices for both 
the weighting schemes are divided into 10 folds, where each 
fold consists of 124 samples as training set and 14 samples as 
validation set (few folds included 125:13 samples). Each fold 
is given as an input to the classifier algorithms viz. 1) Naive 
Bayes classifier 2) SVM-Linear kernel classifier 3) SVM- 
Gaussian kernel classifier. Naive Bayes classifier is based on 
Multinomial distributiorj^ which is used for classifying the 
count-based data such as the Bag of Words (BoW) model. 

5 ) Parameter Tuning: SVM-Linear kernel classifier is used 
with a near boundary coefficient value = 10 and SVM- 
Gaussian kernel classifier is used with a sigma value = 16. 
These values were considered after running the classifier with 
C = {0.01,0.1,1,10,100} and sigma = {8,16,32,64,128}. 

Table IVIIII shows the E-measure on the validation set for 

'http://WWW.ranks.nl/resources/stopwords.html 

■^http: //WWW .mathworks . se/help/stats/naivebayes . f it. 
html 




























SVM-Gaussian classifier for various sigma values {sigma = 
{8,16,32,64,128}). It can be seen that the performance is best 
when sigma value is 16. 


Sigma 

E-Measure (TE) 

E-Measure (TEIDE) 

8 

0.6885 

0.5370 

16 

0.7826 

0.6716 

32 

0.6817 

0.6667 

64 

0.0000 

0.6667 

128 

0.0000 

0.0000 


TABLE VIII 


Parameter tuning for SVM-Gaussian Kernel Classifier 


Eold 

Naive Bayes 

SVM-Linear 

SVM-Gaussian 

1 

0.625 

0.3158 

0.4444 

2 

0.7857 

0.5455 

0.8108 

3 

0.5455 

0.4706 

0.6667 

4 

0.7143 

0.3529 

0.7586 

5 

0.75 

0.5556 

0.6 

6 

0.8 

0.25 

0.8333 

7 

0.6667 

0.7143 

0.6154 

8 

0.9167 

0.6364 

0.9091 

9 

0.6667 

0.4 

0.8 

10 

0.4762 

0.375 

0.5556 

Average 

0.6947 

0.4616 

0.6994 


TABLE XI 


E-measure on validation sets for TEIDE weights 


Table |IX] shows the F-measure on the validation set 
for SVM-Linear classiher for various C values (C = 
{0.001,0.01,0.1,1,10}). It could be seen that the performance 
of the Linear kernel is best when C value is 10. 


C 

E-Measure (TE) 

E-measure (TF-IDE) 

0.01 

0.5000 

0.5000 

0.1 

0.5000 

0.5094 

1 

0.5566 

0.5377 

10 

0.7604 

0.7217 

100 

0.6132 

0.6085 


TABLE IX 


Parameter tuning for SVM-Linear Kernel Classifier 


Weights 

Naive Bayes 

SVM-Lineai‘ 

SVM-Gaussian 

TE 

TEIDE 

0.5882 

0.4941 

0.6916 

0.3143 

0.7207 

0.5825 


TABLE XII 

F-measure on test set 


clear that there is a marginal increase in the F-measure on 
the performance of naive Bayes and SVM-Gaussian classiher 
when using TF weights. SVM-Linear classiher showed an 
increase of 27% when using TF weights. 


6 ) Cross Validation and Testing: Table depicts the 10 
fold cross validation on various classihers used in this exper¬ 
iment on TF weighting scheme. The results of the 10 fold 
cross validation is evaluated against F-measure. Naive Bayes 
classiher achieved 71% F-measure across all the folds. SVM 
classiher with Gaussian and Linear kernels out performed with 
74% and 73% F-measure respectively. 


Eold 

Naive Bayes 

SVM-Linear 

SVM-Gaussian 

1 

0.6667 

0.8696 

0.8333 

2 

0.6 

0.7273 

0.64 

3 

0.9231 

0.8 

0.8 

4 

0.8148 

0.7407 

0.8148 

5 

0.9167 

0.88 

0.9167 

6 

0.7826 

0.6667 

0.6667 

7 

0.8276 

0.8 

0.8462 

8 

0.7368 

0.8 

0.7619 

9 

0.3158 

0.5 

0.56 

10 

0.5 

0.6 

0.5714 

Average 

0.7084 

0.7384 

0.7411 


TABLE X 

F-measure on validation sets for TF weighting scheme. 


Table [Xl| depicts the 10 fold cross validation on various clas¬ 
sihers used in this experiment for TFIDF weighting scheme. 
It can be seen that Naive Bayes and SVM-Gaussian classihers 
achieved 69% F-measure on an average, where as SVM-Linear 
classiher achieved only 46% F-measure. 

Table XII illustrates the performance of the test set on both 
TF and TFIDF weighting scheme. It is clearly seen that the 
SVM classihers have achieved almost 70% of F-measure on 



Fig. 8. Comparison of F-measure (Ffp jp) on validation sets for TF & TFIDF 
weights. 


Figure illustrates the comparison of F-measure on test-sets 
for both TF and TFIDF weighting schemes. There is a 10% 
increase in the F-measure on the performance of naive Bayes 
and SVM-Gaussian classiher when using TF weights, while 
SVM-Linear classiher showed a signihcant increase of 38% 
when using TF weights. 

To summarize, all the three classihers have achieved more 
than 70% F-measure across the folds using TF weights. When 
using TFIDF weights, naive Bayes and SVM-Gaussian clas¬ 
sihers have achieved 69% F-measure across the folds. SVM- 
Linear classiher achieved only 46% F-measure. A marginal 
increase in the F-measure is recorded on the performance 
of naive Bayes and SVM-Gaussian classiher when using 
TF weights. Performance on the test set illustrates that the 
SVM classihers have achieved almost 70% of F-measure on 


TF weights. 

Figure compares F-measure (Ftpjp) on K-fold cross vali¬ 
dation sets for both TF and TFIDF weighting schemes. It is 


TF weights. These results clearly show that the weighting 
scheme TF outperforms TFIDF. The possible reason for this 


phenomenon is discussed in section V-A 


































Fig. 9. Comparison of F-measure on test sets for TF & TFIDF weights. 


B. Connectives Method 


In this section, the implementation procedure for connec¬ 
tives method is discussed along with the results. 

1) Dataset Collection: The dataset is a collection of four 
accident investigation reports, where each report is marked by 
three domain experts for causal relations. A total of 151 causal 
relations are marked. 

2) Implementation: From the dataset, the sentences which 
have connective words such as, transitions, conjunctions and 
verb phrases described in section II-B and listed in Table [n| 
Table Table |IV] and Table |V| are extracted using linux 
command gre/0 and then collected to a new file. A MAIB 
report typically consists of 60 pages and the causal relations 
extracted from an accident investigation report averages on 
10 sentences. Hence a 60 page report is transformed to a half 
page text document including major contributory causes. Some 
example causal relations extracted from few reports, are as 
follows; 


• Cause 1: ”In assessing that Boxford was overtaking the 
fishing vessel, it is clear that the master misinterpreted the 
lights he saw. Consequently, his alteration to starboard 
to keep clear of Admiral Blake only served to reduce 
an already small CPA, thereby exacerbating the close- 
quarters situation.” 

• Cause 2: ’’The master did not activate Saffier’s general 
alarm or alert the crew in any other way. Consequently 
they had limited warning to prepare for, or react to, the 
subsequent damage.” 

• Cause 3; ”No fire detection or fire suppression systems 
were fitted. As a result, the fire was able to develop 
undetected for about minutes.” 

• Cause 4; ’’The distortion and subsequent cracking of the 
furnace tube in the auxiliary boiler was due to sustained 
overheating.” 

• Cause 5: ’’The scenario that the fire was caused when hot 
debris from the hotwork on the hopper came into contact 
with the conveyor belt.” 

• Cause 6; ’’Actions to reduce, or stop, the sheer, were 
insufficient to counteract the forces acting on the hull. 
Therefore, control of Arold was lost and a collision with 
the approaching Anjola ensued.” 


5 

urlhttp://linux.die.net/nian/l/grep 


3) Exploratory analysis: Instead of reading a whole inves¬ 
tigation report, one could read the extracted causal relations 
from the investigation report to find out the contributory causes 
for the marine accident. The causal relations extracted from an 
investigation report are shown below: 

• In assessing that Boxford was overtaking the fishing ves¬ 
sel, it is clear that the master misinterpreted the lights he 
saw. Consequently, his alteration to starboard to keep clear 
of Admiral Blake only served to reduce an already small 
CPA, thereby exacerbating the close-quarters situation. 

• However, these criticisms were at variance with the 
radar’s performance log that indicated the S-band radar 
was functioning correctly. Therefore, it is equally likely 
that the failure to detect Admiral Blake by radar was 
due to the radar’s settings not being optimized for the 
prevailing sea state and the range scale selected. 

• However, the deck cadet on Boxford did not report the 
fishing vessel’s lights until she was at about nm ahead. 
This was probably because the fishing vessel’s lights were 
only intermittently visible due to Admiral Blake’s MV 
Boxford’s view ahead partially obstructed by the uprights 
of the deck cranes, Boxford’s master was unable to detect 
Admiral Blake by radar. 

From these causal relations, it is clearly seen that the con¬ 
tributory causes for the accident, was the radar’s performance 
not being optimized for the prevailing sea state. 

4) Evaluation: The evaluation is subjective since experts 
have marked the causal sentences according to their subjective 
views. In this kind of situation, sometimes qualitative evalu¬ 
ation outweighs the quantitative. To qualitatively evaluate the 
performance of the connectives method (automatic algorithm), 
a questionnaire is given to the domain experts. The ques¬ 
tionnaire and experts’ answers are shown in Table XIII For 
quantitative evaluation. Precision and Recall from the context 
of IR model is adapted. Here, retrieved are the sentences that 
algorithm marked as causal, denoted by ’A’. Relevant are the 
ones that experts’ have marked as causal sentences, denoted by 
’E’. The Precision is evaluated as {E flA)/A and Recall is given 
by {Er)A)/E. F-measure is evaluated as {2x P x R)/{P+ R). 

Expert-1 agrees that the algorithm performs well but some 
passages contain non causal information and does not suffi¬ 
ciently represent the safety management related text. She also 
noted that the algorithm extracted longer fractions of the text 
and marked some redundant information. According to her, 
the algorithm found clearly stated sentences of the accident 
causes. But the sentences describing various situational factors 
to the accident were not mentioned within a clear causal 
sentence format. Table IXI Vl shows a total number of 110 causal 
sentences marked by expert-1 which are relevant. The average 
values of precision, recall and E-measure for connectives 
method on expert-1 marked reports are 0.60, 0.54 and 0.57 
respectively. 

The second expert found interesting information which the 
algorithm unearthed pertaining to the accidents. The algorithm 
performed as per his expectations although in some instances 
context was needed. He claimed that it would be easier to 
read the report generated by the algorithm to capture essential 
information. Table IXVl shows a total number of 110 causal 









Question 

Expert-1 

Expert-2 

Expert-3 

In what kind of situa¬ 
tions do the automatic al¬ 
gorithm and the expert 
agree? or do not agree?, if 
so what are they ? 

They agree on many of the 
sentences, but the expert has 
also considered many more 
passages of text as causal in¬ 
formation. Further, they espe¬ 
cially disagree on safety man¬ 
agement related text. 

It agrees in most cases. Espe¬ 
cially the algorithm extracted 
the causal chains pertaining to 
the accidents, which is of the 
expert’s Interest. 

The automatic algorithm 
and expert agree for 
important causes behind 
the accidents. They do 
not agree for safety policy 
information since that 
information does not 
have causal information 
in them, yet they are 
important in expert’s 
point of view. 

What does the algorithm 
find that the expert didn’t 
consider? 

Basically the algorithm ex¬ 
tracts longer fractions of the 
text and also some redundant 
information that was already 
found before in other part of 
the report (expert had marked 
them only once) 

The algorithm found almost 
what expert has considered 
and also some extra informa¬ 
tion but always contextual in¬ 
formation is needed. 

The algorithm found 
much more information 
than what expert had 
mai'ked. Expert agree 
that the information 

marked by the algorithm 
is important. 

What kind of sentences/ 
expression/ information 
the expert found in the 
automatic causal relations 
extraction? What are 
expert’s generalizations 
about them? 

The information the algo¬ 
rithm found was almost al¬ 
ways clearly stated sentences 
of the investigator’s reasoning 
what might have been causing 
the accidents. The algorithm 
seems to find these quite well. 

The algorithm found the causal 
chains very well. Before read¬ 
ing a whole report, this algo¬ 
rithm could be employed to 
capture causal chains, which 
could make reading more eas¬ 
ier. 

Useful and important 
causal information leading 
to the accidents was found 
in the automatic causal 
relations extracted. It 
would also be interesting 
to see the algorithm 
extracting the information 
related to safety policies. 

What had the expert con¬ 
sidered important but the 
algorithm did not find? 

Safety management related in¬ 
formation, sentences which de¬ 
scribed various situational fac¬ 
tors related to the accident 
but which were not mentioned 
within a clear causal sentence 
form. 

Very few sentences were 
missed out by the algorithm, 
but it works reasonably well 
when extracting the automatic 
causal chains. 

Expert considered 

few safety policies to 
be important which 

algorithm did not find. 
But expert understands 
that those sentences are 
not accurately causal. 


TABLE XIII 

Questionnaire and experts’ answers. 


Report 

E 

A 

Er\A 

(£nA)/A 

(£nA)/£ 

F-measure 

1 

32 

26 

13 

0.5 

0.41 

0.45 

2 

29 

27 

16 

0.59 

0.55 

0.57 

3 

16 

15 

9 

0.6 

0.56 

0.58 

4 

33 

30 

21 

0.7 

0.64 

0.67 

Total 

110 

98 

59 

mean=0.6 

mean=0.54 

mean=0.57 


TABLE XIV 

Performance of connectives method on expert-I marked reports. 


sentences marked by expert-2. The total number of causal 
sentences both expert-2 and algorithm agree on is 61. The 
average values of Precision, Recall and F-measure are 0.62, 
0.75 and 0.68 respectively. 


Report 

E 

A 

Er\A 

(£nA)/A 

[Er\A)lE 

F-measure 

1 

20 

26 

17 

0.65 

0.85 

0.74 

2 

22 

27 

19 

0.7 

0.86 

0.77 

3 

12 

15 

9 

0.6 

0.75 

0.67 

4 

27 

30 

16 

0.53 

0.59 

0.56 

Total 

81 

98 

61 

mean=0.62 

mean=0.75 

mean=0.68 


TABLE XV 

Performance of connectives method on expert-2 marked 


expected in mining automatic casual information. Table XVI 


shows a total number of 60 causal sentences marked by expert- 
3 (relevant). The total number of causal sentences that both 
expert-3 and algorithm agree on is 40. The average values 
of Precision, Recall and F-measure are 0.41, 0.67 and 0.51 
respectively. 


Report 

E 

A 

Er\A 

(£nA)/A 

(£nA)/£ 

F-measure 

1 

17 

26 

14 

0.54 

0.82 

0.65 

2 

11 

27 

7 

0.26 

0.64 

0.37 

3 

7 

15 

5 

0.33 

0.71 

0.45 

4 

25 

30 

14 

0.47 

0.56 

0.51 

Total 

60 

98 

40 

mean=0.41 

mean=0.67 

mean=0.51 


reports. 


TABLE XVI 


Performance of connectives method on expert-3 marked 


reports. 

The expert-3 reiterated the views expressed by expert-1 
in stating that the algorithm missed out some safety policy 

information. He stated that the algorithm performed better than To summarize, all the experts expressed the opinion that 






































the algorithm performed reasonably well. When it comes to 
bringing safety policies to light it could be improved. Figure [T0| 
shows that connectives method gave a good performance on 
expert-2 marked documents. F-measure on expert-2 marked 
reports is 68 % and is greater in comparison with expert -1 
(57%) and expert-3 (51%). The average value of F-measure 
on connectives method is 58%. 



Expert-1 Expert- 2 Expert- 3 


Precision (avg) 

■' Mean (Precision (avg)} 
Recall (avg) 

Mean (Recall (avg)) 
F-measure (avg) 

— — Mean (F-measure (avg)) 


Fig. 10. Comparison of Precison, Recall and F-measure for connectives 
method on the experts’ marked documents. 


V. Conclusions & Discussions 

The objective of this study is to extract causal relations 
from maritime accident investigation reports. The data used in 
this study was a collection of 302 sentences (151 causal and 
151 non-causal sentences). The training and test set consisted 
of 212 (106 causal and 106 non-causal) and 90 sentences 
(45 causal and 45 non-causal) respectively. To achieve the 
objective, this study presented two schemes of extraction 
techniques, namely : 1) Pattern classification method and 2) 
Connectives method. 

Pattern classification method used naive Bayes and Sup¬ 
port Vector Machines (SVM) as classifiers. The input to the 
classifiers were the document-term matrices, where documents 
represented the causal and non-causal sentences and the terms 
represented the Bag of Words (BoW). The document-term 
matrices were constructed using both TF and TFIDF weighting 
schemes. The naive Bayes classifier considered multinomial 
distribution and SVM classifiers used Linear and Gaussian ker¬ 
nels. For the latter classifier, parameter tuning was performed 
to obtain optimal parameters holding best for the classification 
results. 

The K-fold cross validation on all the three classifiers 
achieved more than 70% F-measure on an average using TF 
weights. When using TFIDF weighting scheme, naive Bayes 
and SVM-Gaussian classifiers achieved 69% F-measure across 
the folds, while SVM-Linear classifier achieved only 46% F- 
measure. A marginal increase in the F-measure was recorded 
on the performance of naive Bayes and SVM-Gaussian clas¬ 
sifier when using TF weights. Performance on the test set 
illustrates that the SVM classifiers have achieved almost 70% 
of F-measure on TF weights. 

Connectives method of implementation was rather simpler. 
A linux command ’grep’ extracted all the causal relations 
based on connective words listed in this study. The F-measure 
recorded on expert-1 and expert-3 marked reports are 57% and 


51% respectively. The F-measure on expert-2 marked reports 
is high with 68 %. Hence this study shows that, using text 
mining methods, the causal patterns can be fairly extracted 
from marine accident investigation reports with a reasonable 
F-measure. Comparing the pattern classification method (F- 
measure (average: 65%)) with connectives method F-measure 
(average: 58%), shows pattern classification method gave a fair 
and sensible performance. 


A. Discussion 


The results on the test set clearly show that the weighting 
scheme TF outperforms TFIDF. A high weight in TFIDF is 
reached by a high term frequency (in the given document) 
and a low document frequency of the term in the whole 
collection of documents. Hence TFIDF weights tend to filter 
out common terms. Since the ratio inside the IDF’s log 
function is always greater than or equal to 1, the value of IDF 
(and TFIDF) is greater than or equal to 0. As a term appears in 
more documents, the ratio inside the logarithm approaches 1 , 
bringing the IDF and TFIDF closer to 0. This study included 
most common terms such as: transition words, conjunction 
words and causal verb phrases (chapter 2, section 2.3). These 
words were influential on the performance of classifiers using 
TFIDF weights. 

Machine learning studies for example in [2] reveal that, if 
the datasets used for training and testing a particular classifica¬ 
tion algorithm are very similar, the apparent predictive models’ 
performance may be overestimated, reflecting the ability of the 
model to reproduce its input rather than its ability to interpolate 
and extrapolate. Hence, the actual level of prediction accuracy 
depends on the degree of similarity between training and 
test datasets, which can explain the performance of different 
classifiers being relatively constant with the amount of training 
data. 

The dataset contained 151 data samples corresponding to 
each class. In such a case, 70% of data, i.e. 212 data points 
were used for training the classifiers. With such a small amount 
of training data, SVM classifiers generally generate an over-fit 
or under-fit learning model. Moreover, with lower amounts 
of training data, naive Bayes which is expected to show 
better performance failed to reach the average classification 
accuracies obtained by SVM. Similarly, in the case where 90% 
training data and 10 % data were used for validating, naive 
Bayes failed to compete with SVM learning models (as shown 
in Table |X] and Table XI I. A possible reason for such behavior 
of naive Bayes classifier can be explained by redundancy in 
the data used for training and validating the classifiers [ 2 ]. 

The most important limitation concerning the implementa¬ 
tion of this study is the lack of labeled data. Though there were 
135 accident investigation reports, the analysis considered 
only 4 reports that have been marked by the experts. It is 
still unclear if one can address the ’’ground truth” of the 
expert’s marked sentences as the truth. The labeled data is 
subjective and necessarily one can not say much about the 
performance of the methods employed in the study as the 
evaluation is subjective. In this kind of situations sometimes 
qualitative evaluation outweighs the quantitative. There also 
arises a question whether the evaluation based on these facts 








is reliable as such. Nevertheless, it plays a crucial role in the 
performance of the algorithms. 

To conclude, it is possible to say that experts’ marked 
causal relations from four different accident investigation 
reports were fairly sufficient to classify and extract causal 
patterns from other accident investigation reports. The results 
also suggest that usage of connecting words were influential 
on classification results. It was evident from this analysis, 
that pattern classification method outweighs the connectives 
method. It is still unclear which of the approaches are most 
suitable for exacting causal relations from maritime accident 
reports. When there are many similar methods available it is 
difficult to choose which one to use. In such a case simplicity 
and reputation of the method and experience of its usage 
can influence the decision. This research might embark on 
developing effective tools and methodologies in future for 
identifying human and organizational factors present in the 
accident investigation reports. 
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